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Abstract. 

In 2014 we ran a very successful machine learning challenge in High Ern- 
ergy physics attracting 1785 teams, which exposed the machine learning 
community for the first time to the problem of “learning to discover” 
(www.kaggle.com/c/higgs-boson). While physicists had the opportunity 
to improve on the state-of-the-art using “feature engineering” based on 
physics principles, this was not the determining factor in winning the chal¬ 
lenge. Rather, the challenge revealed that the central difficulty of the prob¬ 
lem is to develop a strategy to optimize directly the Approximate Median 
Significance (AMS) objective function, which is a particularly challenging 
and novel problem. This objective function aims at increasing the power 
of a statistical test. The top ranking learning machines span a variety of 
techniques including deep learning and gradient tree boosting. This paper 
presents the problem setting and analyzes the results. 

Increasingly, machine learning scientists are getting away from canonical 
problems of classification and regression and are venturing into new domains by 
formulating new tasks as learning problems. In the recent years we have seen, for 
example, the task of learning to rank [1] and learning to recommend [2], which 
have become pervasive in applications. In this paper, we tackle a new machine 
learning task addressing the problem of evaluating the significance of a sien- 
tific discovery: learning to discover. Superficially, this problem is a two-class 
classification problem separating events of interest (called signals ) that have not 
been encountered or characterized before in nature (but were predicted by a the¬ 
ory) from events produced by already observed processes (called backgrounds). 
However, the problem setting differs from regular classification problems in two 
respects: (1) Discovery: Since signals were never observed before in real data, 
no labeled training example from real data are available. Rather, simulated 
data (from a simulator implementing theoretical predictions) can be produced 
to generate training data. The learning machine can then address the “inverse 
problem” of predicting which events are signals in real data. (2) Evaluation: 
Typically the number of signals is expected to be orders of magnitude smaller 
that the number of backgrounds. Hence, the statistical significance of the detec¬ 
tion of a number of signal events must be assessed to claim a discovery. For this 
reason, the evaluation function of the classifier is a metric of a statistical test. 

The Discovery-Evaluation criteria define a wide scope that has previously 
been addressed mainly by Neyman-Pearson learning [3]. In this paper, we ex¬ 
amine a new case of the learning to discover problem, the discovery of new 


particles, that leads to a different framework. The problem has been used re¬ 
cently in a comparison of deep vs. shallow representation learning [4]. With 
the HiggsML Machine Learning challenge organized in 2014 1 [5], it becomes a 
benchmark problem. The data were released at http://opendata.cern.ch/ 
after the end of the challenge. This unprecedented disclosure of precious data 
belonging to the ATLAS collaboration highlights the importance of the learning 
to discover task. The dataset, and the subject of the challenge correspond to 
particular physical process, the H —>• r + r _ channel. However, the methodology 
is fully generic to the discovery of a new particle, and could generalize to other 
discovery settings. 

From the algorithmic point of view, the novelty of the problem mainly comes 
from its exotic objective function, the Approximate Median Significance (AMS). 
The AMS presents several undesirable features to train a learning machine: it is 
discontinuous (discontinuity arise for each sample); it is non differentiable; 
it is non additive (the overall AMS is not the sum of individual contributions of 
the samples); it uses sample weights available only for training. This seems to 
have drawn a lot of interest because off-the-shelf packages do not directly support 
the optimization of non-standard objective functions and the participants saw 
an opportunity to make novel contributions and distinguish themselves from the 
rest of the crowd. The top ranking challenge participants greatly improved the 
performance over baseline methods. 

1 The physics problem 

The ATLAS and the CMS experiments jointly claimed the discovery of the Higgs 
boson [6, 7] in 2012. The Higgs boson has many different processes (called chan¬ 
nels by physicsts) through which it can decay , that is produce other particles. 
Beyond the initial discovery, the study of all modes of decay increases confidence 
in the validity of the theory and helps characterize the new particle. The Higgs 
boson was first seen in three distinct decay channels which are all boson pairs. 
One of the next important topics is to seek evidence on the decay into fermion 
pairs, namely tau-leptons or 6-quarks, and to precisely measure their character¬ 
istics. The first evidence of the H to tau tau channel was recently reported by 
the ATLAS experiment [8]. The aim of the challenge is to increase the statistical 
significance of this discovery. 

Discovery and characterization rely on experiments that run at the Large 
Hadron Collider (LHC) at CERN. Hundreds of millions of proton-proton colli¬ 
sions per second are produced. The particles resulting from each bunch crossing 
are detected by sensors and filtered in real-time. For each collision, the raw 
data produced by the sensors are ultimately digested into a vector of features 
containing up some tens of real-valued variables. This vector is called an event. 

The vast majority of events represent known (background) processes: they 
are mostly produced by processes which are exotic in everyday terms, but known, 
having been discovered in previous generations of experiments. The learning 

^ww.kaggle.com/c/higgs-boson and https://higgsml.lal.in2p3.fr 



problem is to find a region in feature space with a significant excess of signal 
events. Discovery of a new particle ultimately boils down to classical statistical 
testing. The null hypothesis is that the experiment produces only background 
events, and the alternative hypothesis is that it produces some signal events. 

2 The AMS objective function 

For the formal description of the Challenge, let V = | (xi, y \, Wi ),..., (x„, y n , w n )} 
be the training sample, where x, £ is a d-dimensional feature vector, yi £ 
{b,s} is the label, and wi £ R + is a non-negative weight. Let S = {i : yi = s} 
and B = {i : yi = b} be the index sets of signal and background events, respec¬ 
tively, and let n s = |<S| and n b = \B\ be the numbers of simulated signal and 
background events 2 3 . 

There are two properties that make our simulated set different from those 
collected in nature or sampled in a natural way from a joint distribution p(x, y ). 
First, as many events of the signal class as needed can be simulated (within a 
computational budget), so the proportion n s /n\, of the number of points in the 
two classes does not have to reflect the proportion of the prior class probabilities 
P(y = s)/P(y = b). This is actually a good thing: since P(y = s) <C P(y = b), 
the training sample would be very unbalanced if the numbers of signal and 
background events, n s and n b, were proportional to the prior class probabilities 
P(y = s) and P(y = b). Second, the simulator produces importance-weighted 
events. Since the objective function (3) will depend on the unnormalized sum of 
weights, to make the setup invariant to the numbers of simulated events n s and 
rib, the sum across each set (training, public test, private test, etc.) and each 
class (signal and background) is be kept fixed, that is, 

^ Wi = N s and ^Wj = N h . (1) 

zG«S i£B 

The normalization constants N s and A r b have physical meanings: they are the 
expected total number of signal and background events, respectively, during the 
time interval of data taking (the year of 2012 in our case). The individual weights 
are proportional to conditional densities ratios: 

_ ^ fp s (x i )/q , s (a:i), if Vi = s, 

\pb{xi)/qb{xi), if Vi = b, 

where p s (xj) = p(xi\y = s) and Pb( x i) = p( x i\y = b) are the conditional 
signal and background densities, respectively, and g s (x,) and gb( x i) are instru¬ 
mental densities used by the simulator. 

2 We use roman s to denote the label and in indices of terms related to signal (e.g., n s ), and 
s for the estimated number of signal events selected by a classifier. The same logic applies to 
the terms related to background. 

3 We use small p for denoting probability densities and capital P for denoting the probability 
of random events. 



Let g : R d —► {b, s} be an arbitrary classifier. Let the selection region Q = 
{x : g(x) = s} be the set of points classified as signal, and let Q denote the 
index set of points that g selects (physics terminology) or classifies as signal 
(machine learning terminology), that is, Q = {* : x, €(?} = {*: g(pci) = s}. 
Then from Eqs. (1) and (2) it follows that the quantity s = Y^ieSng Wi 4S an 
unbiased estimator of the expected number of signal events selected by g , and, 
b = Y^i(zi3nS Wi 4S an un biased estimator of the expected number of background 
events selected by 5 , In machine learning terminology, s and b are are 
true and false positive rates. Given a classifier g. the AMS objective function 
used for the challenge is defined by 


AMS-y2^( S + 6 + 6 reg )ln(^l + ^-) " *) ( 3 ) 

where b reg is a regularization term (e.g. 6 reg = 10). The derivation of this 
formula is given in [9]. Quantitatively, a fluctuation is considered anomalous 
by physicists (the evidence of a discovery) if it exceeds 5cr, that is if AMS > 5, 
which corresponds to a p-value of the one-sided Z-test of 3 x 10“'. 

3 Results of the challenge 

Organizing a challenge can be thought of as designing a large scale numerical 
experiment in which research is “crowd-sourced”. Given a well-posed scientific 
question, challenges are very effective in obtaining an answer. This has been the 
case for the Higgs-Boson challenge. The challenge helped making both qualita¬ 
tive and quantitative advances to the problem of optimizing the AMS. 

First, from the quantitative point of view, the AMS of the top ten participants 
ranged in [3.76, 3.80], while the benchmark (based on a software widely used in 
High Energy physics called TMVA 4 , ranked only 782 with an AMS of ^3.50. 
This brings us closer (albeit still far) from the target set forth by physicists 
of AMS=5. The winning solution of Gabor Melis [10] uses a deep learning 
method (an ensemble of 70 3-layers neural networks, fully interconnected, with 
600 hidden units per layer), confirming that deep learning methods can be very 
competitive. But, most top ranking participants used ensembles of decision 
trees, and particularly the XGBoost software 5 . The ease of interpretability of the 
model helped the crowwork team in feature construction and hyper-parameter 
tuning to attain an AMS of 3.76 [11]. The major advantage of the XGBoost 
over straightforward gradient boosting relies on explicit regularization, reducing 
the need of manual hyper-parameter tuning. However, the solution of Melis is 
clearly more robust as indicated by the fact that it dominated all other solutions, 
regardless of the choice on the threshold monitoring the tradeoff between true 
positive rate and false positive rate. This robustness may be attributable to 
the use of a voting ensemble and the use of “drop-out”, a method consisting in 


4 http://tmva.sourceforge.net 
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silencing neurons at random during training. The price to be payed for such 
a deep learning solution is months of architecture and hyper-parameter tuning, 
using a fast GPU-powered computer. In contrast, the boosted decision tree 
solutions are obtained relatively fast (in minutes) with little need for hyper¬ 
parameter tuning: the completely untuned XGBoost achieved ~3.64 on the test 
set, ranking in the 200th. 

Second, several important qualitative (conceptual) advances were made. Most 
of the participants (including the winner), followed a bi-level optimization ap¬ 
proach, optimizing a surrogate cost function (logistic loss or AUC) and adjusting 
the cut-off classification threshold to optimize the AMS by cross-validation. For 
the logistic loss, [12] provides a theoretical justification for the default approach 
to post-fit the cut-off, proving asymptotic consistency and bounding the rate of 
convergence. However some effort was put into finding other ways of optimizing 
the AMS. For example, the 9th solution [13] uses a weighted version of the AUC 
as a surrogate cost function, thus approximating the AMS with an additive loss. 
Another principled approach to exploit all classical additive cost functions with 
a simple sample reweighing has been proposed by [14] who applies a variational 
method to optimize iteratively a set of linearized versions of the AMS. 

We also had an interesting “negative” result. The promise that deep learning 
methods can save on human effort by learning internal representation in place of 
feature engineering was not held in the challenge. In the winning solution, the 
AMS drops by 13% if the derived (human engineered) features are not exploited. 
A recent benchmark of deep and shallow networks also show disappointing results 
for deep learning for the Higgs in tau tau task [4], with a 6% performance drop. 
Although the results for learning internal representations are more encouraging 
for other high energy physics tasks, the ratio gain to sample size indicates that 
learning representations is extremely data demanding. 

Another noteworthy development of the challenge concerns cross-validation. 
All top ranking participant carefully avoided to rely on the performances shown 
on the public leaderboard (computed on a too small data sample to provide 
reliable performance evaluation). Rather they repeated multiple times 10-fold 
cross-validation and averaged the results. For computation efficiency reasons, 
they used the learning machines thus trained as part of a voting ensemble. A 
interesting additional twist helped Melis win the challenge: Since the AMS is 
not an additive loss, it is different (i) to compute the AMS on the (very small) 
held out sets, then averaging the results or (ii) to collect statistics about false 
positive and false negative on the the held out sets, average them, then compute 
the AMS. The latter provides much smoother results. 

In conlusion, the challenge revealed that participants who are skilled data 
scientitists with only elementary knowledge of physics could contribute to signif¬ 
icantly improve the power of the AMS using machine learning. Deep Learning is 
a favorite technique but boosted decision trees are a strong contender, with sev¬ 
eral practical advantages. Bi-level optimization using a surrogate cost function 
beats direct optimization of the AMS, but variational methods, which tackle the 
problem head front, provide a new avenue of research worth exploring further. 
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